Keywords, k-NN and Neural Networks: a Support for Hierarchical Categorization of Texts in Brazilian Portuguese

نویسندگان

  • Susana Azeredo
  • Silvia Moraes
  • Vera Lúcia Strube de Lima
چکیده

A frequent problem in automatic categorization applications involving Portuguese language is the absence of large corpora of previously classified documents, which permit the validation of experiments carried out. Generally, the available corpora are not classified or, when they are, they contain a very reduced number of documents. The general goal of this study is to contribute to the development of applications which aim at text categorization for Brazilian Portuguese. Specifically, we point out that keywords selection associated with neural networks can improve results in the categorization of Brazilian Portuguese texts. The corpus is composed of 30 thousand texts from the Folha de São Paulo newspaper, organized in 29 sections. In the process of categorization, the k-Nearest Neighbor (k-NN) algorithm and the Multilayer Perceptron neural networks trained with the backpropagation algorithm are used. It is also part of our study to test the identification of keywords parting from the log-likelihood statistical measure and to use them as features in the categorization process. The results clearly show that the precision is better when using neural networks than when using the k-NN.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

‘Minor’ Languages, ‘Broken’ Translations: On Brazilian Reworkings of an Albanian Novel

This essay approaches the challenges of global translation in the 21st century from what might still be considered a somewhat uncommon example: a direct translation of Ismail Kadaré's 1978 novel Prill e thyër (Broken April) from the original Albanian into Brazilian Portuguese in 2001. Not only does it examine and compare lexical elements in the source and target texts and the usage of translato...

متن کامل

Application of Wavelet Neural Networks for Improving of Ionospheric Tomography Reconstruction over Iran

In this paper, a new method of ionospheric tomography is developed and evaluated based on the neural networks (NN). This new method is named ITNN. In this method, wavelet neural network (WNN) with particle swarm optimization (PSO) training algorithm is used to solve some of the ionospheric tomography problems. The results of ITNN method are compared with the residual minimization training neura...

متن کامل

Prediction of true critical temperature and pressure of binary hydrocarbon mixtures: A Comparison between the artificial neural networks and the support vector machine

Two main objectives have been considered in this paper: providing a good model to predict the critical temperature and pressure of binary hydrocarbon mixtures, and comparing the efficiency of the artificial neural network algorithms and the support vector regression as two commonly used soft computing methods. In order to have a fair comparison and to achieve the highest efficiency, a comprehen...

متن کامل

Construction of Neural Networks that Do Not Have Critical Points Based on Hierarchical Structure

a critical point is a point at which the derivatives of an error function are all zero. It has been shown in the literature that critical points caused by the hierarchical structure of a realvalued neural network (NN) can be local minima or saddle points, although most critical points caused by the hierarchical structure are saddle points in the case of complex-valued neural networks. Several s...

متن کامل

Performance Analysis of Naiotave Bayes Classification, Support Vector Machines and Neural Networks for Spam Categorization

Spam mail recognition is a new growing field which brings together the topic of natural language processing and machine learning as it is in essence a two class classification of natural language texts. An important feature of spam recognition is that it is a cost-sensitive classification: misclassification of a non-spam mail as spam is generally a more severe error than misclassifying a spam m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008